In [1]:
using DataFrames
In [9]:
df = readtable("data.csv");
In [10]:
size(df)
Out[10]:
In [11]:
names(df)
Out[11]:
In [12]:
df[:timers_t_done]
Out[12]:
In [13]:
df[30:40, :timers_t_done]
Out[13]:
In [14]:
df[30:40, [:timestamp, :geo_cc, :geo_netspeed, :user_agent_family, :timers_t_done]]
Out[14]:
Most Julia stats functions run on AbstractArray, which is the base type for Array as well as DataArray, so you can run them on any column of a DataFrame that contains numbers. You will probably need to remove NAs first using the dropna function.
Our test dataset doesn't contain any NA values for the timers_t_done column, so we're safe.
In [15]:
summarystats(df[:timers_t_done])
Out[15]:
The hist function will by default split the dataset into equal sized buckets based on the data's range. This may not always be what you want, so you can pass in a list of thresholds as the second parameter.
The hist function returns a tuple. The first element is the thresholds used, which might be a Range object or an Array. The second element is the list of bucket frequencies.
In [16]:
hist(df[:timers_t_done])
Out[16]:
We could use static thresholds, but that wouldn't adapt to different data sets. In this case, we develop a Julia function that determines thresholds based on the dataset.
Rather than divide the entire range into a fixed set of buckets, we divide the Inter-Quartile Range. This has the advantage of excluding outliers from the basic range. We then include outliers in their own buckets, one for the low bound and one for the high bound.
This is very similar to a box and whiskers plot.
In [17]:
# Function to set histogram thresholds after dropping outliers based on IQR
function getSymmetricThresholds(results::DataFrame; timer::Symbol=:timers_t_done)
summary = summarystats(results[timer])
fw = (summary.q75-summary.q25)*1.5
low = round(Int64, max(summary.min, summary.q25-fw))
high = round(Int64, min(summary.max, summary.q75+fw))+1
thresholds::Array{Int64, 1} = []
nthresholds=25
range = high - low
for i in 0:nthresholds-1
push!(thresholds, round(Int64, low + i * range/nthresholds))
end
push!(thresholds, high)
if high < round(Int64, summary.max)
push!(thresholds, round(Int64, summary.max))
end
return thresholds
end
Out[17]:
Notice that Julia functions are declared using the function keyword. Function parameters may have types attached to them, this is optional, and mainly useful when you overload function names.
Functions may have optional parameters, a ; separates required parameters from optional ones.
When passing optional parameters to a function, they need to be passed by name, and order doesn't matter.
A function typically only returns a single value, though that value may be a tuple of multiple objects. The caller can then receive the return value into a single tuple or multiple values enclosed in ().
In [18]:
thresholds = getSymmetricThresholds(df)
Out[18]:
Running the hist function using our new thresholds gets us much better granularity into the data.
In [19]:
hist_global = hist(df[:timers_t_done], thresholds)[2]
Out[19]:
In [22]:
results_US = df[!isna(df[:geo_cc]) & (df[:geo_cc] .== "US"), :];
In [23]:
hist_US = hist(results_US[:timers_t_done], thresholds)[2]
Out[23]:
In [24]:
cor(hist_global, hist_US)
Out[24]:
We could also run cumsum to generate the CDF from the histogram and correlate those values.
In [25]:
cor(cumsum(hist_global), cumsum(hist_US))
Out[25]:
In [26]:
by(df, :user_agent_family, rows -> median(rows[:timers_t_done]))
Out[26]:
If the aggregation function returns an array, like the hist function does, then we'll actually end up with one row per array element. Instead we need to serialize the array to a string or create a custom data type that encapsulates the array. The string method is easier albeit a little slower, but if we're going to export our data to JavaScript, we may need to do this anyway.
In [27]:
by(
df,
:user_agent_family,
rows -> DataFrame(
count = size(rows, 1),
median = median(rows[:timers_t_done]),
hist = JSON.json(hist(rows[:timers_t_done], thresholds)[2])
)
)
Out[27]:
In [28]:
println("Histogram:\n", JSON.json(hist_global))
println()
println("Thresholds:\n", JSON.json(thresholds))
In [ ]: